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^ Abstract. In this paper we investigate the problem of building a static data structure that represents 

^-^ a string s using space close to its compressed size, and allows fast access to individual characters of s. 

r\l This type of structures was investigated by the recent paper of Bille et al. [■(]. Let n be the size of a 

context-free grammar that derives a unique string s of length L. (Note that L might be exponential in 
n.) Bille et al. showed a data structure that uses space 0{n) and allows to query for the i-th character 
of s using running time 0(log L). Their data structure works on a word RAM with a word size of log L 
f^ ^ bits. 

^_ Here we prove that for such data structures, if the space is poly(n), then the query time must be 

at least (log L)^"'^/ log 5 where 5 is the space used, for any constant e > 0. As a function of n, our 

I I lower bound is i7(n^"~^). Our proof holds in the cell-probe model with a word size of logL bits, so 

C) in particular it holds in the word RAM model. We show that no lower bound significantly better than 
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that uses 0{n) space and achieves 0(^n log n) query time. The "bad" setting of parameters occurs 
fx roughly when L — 2^. We also prove a lower bound for the case of not-as-compressible strings, where, 

—I say, L = n^"'"^. For this case, we prove that if the space is n ■ polylog{n), then the query time must be 

at least i?(logn/loglogn). 

The proof works by reduction to communication complexity, namely to the LSD (Lopsided Set Dis- 
jointness) problem, recently employed by Patra§cu and others. We prove lower bounds also for the case 
of LZ-compression and Burrows- Wheeler (BWT) compression. All of our lower bounds hold even when 



(— ^ the strings are over an alphabet of size 2 and hold even for randomized data structures with 2-sided 



error. 
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p^ 1 Introduction 

T— I 

r* In many modern databases, strings are stored in compressed form. Many compression schemes are grammar- 

.J^ based, in particular Lempel-Ziv [8,14,15] and its variants, as well as Run-Length Encoding. Another big 

S^ family of textual compressors are BWT (Burrow- Wheeler Transformation [ ]) based compressors, like the 

Jh one used by the software bzip2. 

A natural desire is to store a text using space close to its compressed size, but to still allow fast access to 
individual characters: can we do something faster than simply extracting the whole text each time we need 
to access a character? This question was recently answered in the affirmative by Bille et al. [■'>] and by Claude 
and Navarro [G] . These two works investigate the problem of storing a string that can be represented by a 
small CFG (context-free grammar) of size n, while allowing some basic stringology operations, in particular 
random access to a character in the text. The data structure of Bille et al. [■:), Theorem 1] stores the text 
in space linear in n, while allowing access to an individual character in time O(logL), where L is the text's 
uncompressed size. (The result of Bille et. al. also allows other query operations such as pattern matching; 
we do not discuss those in this paper.) But is that the best upper bound possible? 
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In this paper we show a (logi)"'^"^ lower bound on the query time whenever the space used by the data 
structure is poly{n), showing that the resuh of BiUe et ah is close to optimal. Our lower bounds are proved in 
the cell-probe model of Yao [ ], with word size logL, therefore they in particular hold for the model studied 
by Bille et al. [3], since the cell-probe model is strictly stronger than the RAM model. Our lower bound 
is proved by a reduction to Lopsided Set Disjointness (LSD), a problem for which Patra§cu has recently 
proved an essentially-tight randomized lower bound [11]. The idea is to prove that grammars are rich enough 
to effectively "simulate" a disjointness query: our class of grammars, presented in Section 3.1, might be of 
independent interest as a class of "hard" grammars for other purposes as well. 

In terms of n, our lower bound is n^''^~^. The results of Bille et al. imply an upper bound of 0{n) 
on the query time, since logi < n, therefore in terms of n there is a curious quadratic gap between our 
lower bound and Bille et al.'s upper bound. We show that this gap can be closed by giving a better data 
structure: we show a data structure which takes space 0(n) and has query time 0(\/n\ogn), showing that 
no significantly better lower bound is possible. This data structure, however, comes with a big caveat - it 
runs in the highly-unrealistic cell-probe model, thus serving more as an impossibility proof for lower bounds 
than as a reasonable upper bound. The question remains open of whether such a data structure exists in the 
more realistic word RAM model. 

Our lower bound holds for a particular, "worst-case" , dependence of L on n. Namely, L is roughly 2^. It 
might also be interesting to explicitly limit the range of allowed parameters to other regimes, for example to 
non- highly-compressible text; in such a regime it might be that L = n^~^^ . The above result does not imply 
any lower bound for this case. Thus, we show in another result that for any data structure in that regime, 
if the space is n ■ polylogn, then the query time must be i7(logn/loglogn). This lower bound holds, again, 
in the cell probe model with words of size logn bits, and is proved by a reduction to two-dimensional range 
counting (which, once again, was lower bounded by a reduction to LSD [11]). To this end, we prove a new 
lower bound for two-dimensional range counting on [n] x [n^] grid, which was not previous known, and is of 
independent interest. 

2 Preliminaries 

In this paper we denote [m] = {1, . . . ,rn}. All logarithms are in base 2 unless explicitly stated otherwise. 

Our lower bounds are proved in Yao's cell-probe model [ ]. In the cell-probe model, the memory is seen 
as an array of cells, where each cell consists of w bits each. The query time is measured as the number of 
cells read from memory, and all computations are free. This model is strictly stronger than the word RAM, 
since in the word RAM the operations allowed on words are restricted, while in the cell-probe model we 
only measure the number of cells accessed. The cell-probe model is widely used in proving data structure 
lower bounds, especially by reduction to communication complexity problems [ ]. In this paper we prove 
our result by a reduction to the Blocked-LSD problem introduced by Patra§cu [ ]. 

An SLP (straight line program) is a collection of n derivation rules, defining the symbols gi, . . . ,g„. Each 
rule is either of the form g^ — >■ 'cr', i.e. gi is a terminal, which takes the value of a character a from the 
underlying alphabet, or of the form gi — >■ gjgk, where j < i and k < i, i.e. gj and gu were already defined, 
and we define the nonterminal symbol gi to be their concatenation. The symbol gn is the start symbol. To 
derive the string we start from gn and follow the derivation rules until we get a sequence of characters from 
the alphabet. The length of the derived string is at most 2". W.l.o.g. we assume it is at least n. As the same 
in Bille et al. [3], we also assume w.l.o.g. that the grammars are in fact SLPs and so on the righthand side 
of each grammar rule there are either exactly two variables or one terminal symbol. In this paper SLP, CFG 
and grammar all mean the same thing. 

The grammar random access problem is the following problem. 

Definition 1 (Grammar Random Access Problem). For a CFG G of size n representing a binary 
string of length L, the problem is to build a data structure to support the following query: given 1 < i < L, 
return the i-th character (bit) in the string. 

We study two other data-structured problems, which are closely related to their communication-complexity 
counterparts. 



Definition 2 (Set Disjointness, SD^). For a set Y C [N], the problem is to build a data structure to 
support the following query: given a set X C [N], answer whether X fl Y = 0. 

Given a universe [BN] , a set X is called blocked with cardinality B if when we divide the universe [BN] 
into N equal-sized consecutive blocks, X contains exactly one element from each of the blocks while Y could 
be arbitrary. 

Definition 3 (Blocked Lopsided Set Disjointness, BLSDb,n)- For a set Y C [BN], the problem is 
to build a data structure to support the following query: given a blocked set X C [BN] with cardinality N , 
answer whether X n Y = 0. 

For proving lower bound for near-linear space data structures, we also need a variant of the range counting 
problem. 

Definition 4 (Range Counting on [n] ^ Grid). The range counting problem is a static data structure 
problem. We need to preprocess a set of n points on a [n] x [n'^] grid. A query {x,y) asks to count the number 
of points in a dominance rectangle [l,a;] x [l,y], modulo 2. 

When e = 1, the above problem has been investigated under the name "range counting" in Patra^cu [ ]. 
Note that the above problem is "easier" than the classical 2D range-counting problem, since it is a dominance 
problem, it is on the grid, and it is modulo 2. However, the (tight) lower bound that is known for the general 
problem in Patra§cu [11], is proven for the problem we define. In this paper, we extend this result a little 
bit to give a lower bound on the universe of [n] x [n^] for any constant e. 

3 Lower Bound for Grammar Random Access 

In this section we prove the main lower bound for grammar random access. In Section 3.1 we show the main 
reduction from SD and BLSD. In Section 3.2 we prove lower bounds for SD and BLSD, based on reductions 
to communication complexity (these are implicit in the work of Patra§cu [ ]). Finally, in Section 3.3 we tie 
these together to get our lower bounds. 

3.1 Reduction from SD and LSD 

In this section we show how to reduce the grammar access problem to SD or BLSD, by considering a 
particular type of grammar. The reductions tie the parameters n and L to the parameters B and A^ of BLSD 
(or just to the parameter N of SD). In Section 3.3 we show how to choose the relation between the various 
parameters in order to get our lower bounds. We remark that the particular multiplicative constants in the 
lemmas below will not matter, but we give them nonetheless, for concreteness. 

These reductions might be confusing for the reader, but they are in fact almost entirely tautological. 
They just follow from the fact that the communication matrix of SD is a tensor product of the 2 by 2 

communication matrices for the coordinates, i.e., it is just a A'^-fold tensor product of the matrix , p, I • For 

BLSD, the communication matrix is the A^-fold tensor product of the (2^) x B communication matrix for 

/lOlOlOlOX 
each block (for example, for B = i this matrix is 11001100 ). We do not formulate our arguments 

\1 1 1 10000/ 
in the language of communication matrices and tensor products, since this would hide what is really going 
on. To aid the reader, we give an example after each of the two constructions. 

Lemma 1 (Reduction from SDjy). For any set Y C [N], there is a grammar Gy of size n = 2N + 1 
deriving a binary string sy of length L = 2^ such that for any set X C [N], it holds that sy[X] = 1 iff 
XnY = 0. 



Note that in this lemma we have indexed the string s by sets: there arc 2^ possible sets X, and the 
length of the string sy is also 2^ - each set X serves as an index of a unique character. The index- 
ing is done in lexicographic order: the set X is identified with its characteristic vector, i.e., the vector in 
{0,1} whose i-th coordinate is '1' if i e X, and '0' otherwise, and the sets are ordered according to 
lexicographic order of their characteristic vectors. For example, here is the ordering for the case A'^ = 3: 
0,{1}, {2}, {1,2}, {3}, {1,3}, {2,3}, {1,2,3}. 

Proof. We now show how to build the grammar Gy- The grammar has N symbols for the strings 0, 0^, 0*, . . . , 0^ 
i.e., all strings consisting solely of the character '0', of lengths which are all powers of 2 up to 2^^^. Then, 
the grammar has TV + 1 additional symbols go,gi, . . . ,gN- The terminal go is equal to the character 1. For 
any 1 < i < iV, we set gi to be equal to gi^igi^i if z <^ Y, and to be equal to gt-iO^ if i € Y. The start 
symbol of the grammar is g^. 

We claim that the string derived by this grammar has the property that sy[X] = liffXnY = 0. This is 
easy to prove by induction on i, where the induction claim is that for any i, gi is the string that corresponds 
to the set Y n {1, . . . , i} over the universe {1, . . . , i}. D 

Example 1. Consider the universe A^ = 4. Let Y = {1, 3}. The string s^ is 1010000010100000. The locations 
of the I's correspond exactly to the sets that don't intersect Y, namely to the sets 0, {2}, {4} and {2,4}, 
respectively. 

We now show the reduction from blocked LSD. It follows along the same general idea, but the grammar 
is slightly more complicated. 

Lemma 2 (Reduction from BLSDb,n)- For any set Y C [BN], there is a grammar Gy of size n — 
2BN + 1 deriving a binary string sy of length L — B^ such that for any Mocked set X C \BN] of cardinality 
m, it holds that sy[X] = 1 ijf X n Y = 0. 

Recall that by a "blocked set X C [BN] of cardinality TV" we mean a set such that the universe [BN] is 
divided into TV equal-sized blocks, and X contains exactly one element from each of these blocks. 

Note that in this lemma we have again indexed the string s by sets: there are B^ possible sets X and 
the length of the string is B^ . The indexing is done in lexicographic order, this time identifying a set X with 
a length-TV vector whose i-th coordinate is chosen according to which element it contains in block i, and 
the sets are ordered according to lexicographic order of their characteristic vectors. For example, here is the 
ordering for the case TV = 2, B = 3: {1, 4} , {2, 4} , {3, 4} , {1, 5} , {2, 5} , {3, 5} , {1, 6} , {2, 6} , {3, 6}. 

The construction in this reduction is similar to that in the case of SD, but instead of working element 
by element, we work block by block. 

Proof. We now show how to build the grammar Gy • The grammar has TV symbols for the strings 0,0^,0^ ,0^ , . 
i.e., all strings consisting solely of the character '0', of lengths which are all powers of B up to B^^^. We 
cannot simply obtain the symbols directly from each other: e.g., to obtain 0^ from 0^, we need to con- 
catenate 0^ with itself B times. Thus we use i?TV rules to derive all of these symbols. (In fact, 0(TVlogi?) 
rules can suffice but this docs not matter). 

Then, beyond these, the grammar has TV -|- 1 additional symbols gg, (/i, . . . ,gN, one for each block. The 
terminal go is equal to the character 1. For any 1 < 1 < N , gi is constructed from gi-i according to which 
elements of the i-th block arc in Y: we set gi to be a concatenation of B symbols, each of which is either 
gi-i or . In particular, gi is the concatenation oi g)^ ,. . . ,gi , where gf is equal to gi^i if the j-th 
element of the i-th block is not in Y, and it is equal to 0^ if the j-th element of the i-th block is in Y. 
To construct these symbols we need at most BN rules, because we need B — 1 concatenation operations to 
derive gi from gi^i- (Note that here we cannot get down to 0(TVlogi?) rules - 0{BN) seem to be necessary.) 
The start symbol of the grammar is gN. 

We claim that the string produced by this grammar has the property that sy[X] = liffXnY = 0. 
This is easy to prove by induction on i, where the induction claim is that for any i, gi is the string that 
corresponds to the set X n {1 . . . , iB} over the universe {1, . . . , iB}. D 



Example 2. Consider the values B = 3 and iV = 3. Let Y = {1, 3, 5, 9}. The string sy is 010000010010000010000000000. 
The locations of the I's correspond exactly to the blocked sets that don't intersect Y, namely to the sets 
{2, 4, 7}, {2, 6, 7}, {2, 4, 8} and {2, 6, 8}, respectively. A brief illustration for this example is in Figure 1. 

52 = 010000010 

53 = 010000010 010000010 000000000 

{7}ny = {8}ny = {9}ny = {9} 

Fig. 1. An illustration of Example 2. 



3.2 Lov^fer bounds for SD and BLSD 

In this subsection we show lower bounds for SD and BLSD that are implicit in the work of Patra§cu [11]. 
Recall the notations from Section 2: in particular, in all of the bounds, uj, S , and t denote the word size 
(measured in bits), the size of the data structure (measured in words) and the query time (measured in 
number of accesses to words) , respectively. 

Theorem 1. For any 2-sided-error data structure for SDm, t > Q{N/{w + logS")). 

Note that this theorem does not give strong bounds when w = 0(log L), but it is meaningful for bit-probe 
(w — 1) bound and a warm-up for the reader. 

Theorem 2. Let s > be any small constant. For any 2-sided-error data structure for BLSDb,n , 

The proofs follow by standard reductions from data structure to communication complexity, using known 
lower bounds for SD and BLSD (the latter is one of the main results in [11]). 
We now cite the corresponding communication complexity lower bounds: 

Lemma 3 (See [ %12,1]). Consider the communication problem where Alice and Bob each receive a subset 
of m, and they want to decide whether the sets are disjoint. Any randomized 2-sided-error protocol for this 
problem uses communication [2(N). 

Lemma 4 (See [11], Lemma 3.1). Let e > be any small constant. Consider the communication problem 
where Bob gets a subset of [BN] and Alice gets a blocked subset of [BN] of cardinality N , and they want 
to decide whether the sets are disjoint. In any randomized 2-sided-error protocol for this problem, either 
Alice sends n{N\ogB) bits or Bob sends B^^^N bits. (The fi-notation hides a multiplicative constant that 
depends on e). 

The way to prove the data structure lower bounds from the communication lower bounds is by reductions 
to communication complexity: Alice and Bob execute a data structure query; Alice simulates the querier, 
and Bob simulates the data structure. Alice notifies Bob which cell she would like to access; Bob returns that 
cell, and they continue for t rounds, which correspond to the t probes. At the end of this process, Alice knows 
the answer to the query. Overall, Alice sends tlogiS bits and Bob sends tw bits. The rest is calculations, 
which we include here for completeness: 

Proof (Lemma 3 =^ Theorem 1). We know that the players must send a total of Q{N) bits, but the data 
structure implies a protocol where tlogS -\- tw bits are communicated. Therefore t\ogS -\- tw > f2{N) so 
t>Q[N/{\ogS + w)). U 

Proof (Lemma 4 ^ Theorem 2). We know that either Alice sends fi{N\ogB) bits or Bob sends B^^^N 
bits. Therefore, either t\ogS > fi{N\ogB) or tw > B^~^N. The conclusion follows easily. D 



3.3 Putting it Together 

We now put the results of Section 3.1 and 3.2 together to get our lower bounds. Note that in all lower bounds 
below we freely set the relation of n and L in any way that gives the best lower bounds. Therefore, if one is 
interested in only a specific relation of n and L (say L — n^*^) the lower bounds below are not guaranteed 
to hold. The typical "worst" dependence in our lower bounds (at least for the case where w — logL and 
S = poly(?i)) is roughly L = 2^. 

Theorem 1 together with Lemma 1 immediately give: 

Theorem 3. For any 2-sided-error data structure for the grammar random access problem, t > f2(n/(w + 
logS)). And in terms of L, t > f2{\ogL/{w + \ogS)). 

When setting w = I and S = poly{n) (polynomial space in the bit-probe model), we get that t > 
n{n/\ogn). And in terms of L, t> J7(logL/loglogL). 

Proof Trivial, since n = 0{N) and L = 2'S'(^). D 

Theorem 2 together with Lemma 2 give: 



Theorem 4. Assume w = io{\ogS). Let e > be any arbitrarily small constant. For any 2-sided-error data 
structure for the grammar random access problem, t > n/w^^^ . And in terms of L, t> °^ e . 

\ogS-w 1^^ 

When setting w = log L and S = poly(n) (polynomial space in the cell-probe model with cells of size 
logL), there is another constant S such that we get that t > n^'^~ . And in terms of L, t> (logL)^^ . 

The condition w — uj{logS) is a technical condition, which ensures that the value of B we choose in the 
proof is at least a;(l). For w < log 5 one gets the best results just by reducing from SD, as in Theorem 3. 

Proof. For the first part of the theorem, substitute B ~ {w/\ogSy^^^~^'' \og{w/S), N — n/B, L = B^ into 
(1). For the second part of the theorem, substitute N — —. — 2°^" , n — BN and L = B^ . And for the result, 
set (5=^. D 

4 Lower Bound for Less-Compressible Strings 

4.1 The Range Counting Lower Bound 

In the above reduction, the worst case came from strings that can be compressed superpolynomially. How- 
ever, for many of the kinds of strings we expect to encounter in practice, superpolynomial compression is 
unrealistic. A more realistic range is polynomial compression or less. In this section we discuss the special 
case of strings of length ©(ti^^^). We show that for this class of strings, the Bille et al. [■>] result is also 
(almost) tight by proving an f?(log7i/loglogri) lower bound on the query time, when the space used is 
0{n ■ polylogn). This is done by reduction from the range counting problem on a 2D (two-dimensional) grid. 
Leaving the proof to Appendix A, we have the following lower bound for the range counting problem (see 
Definition 4 for details). 

Lemma 5. Any data structure for the 2D range counting problem using 0(n polylogn) space requires query 
time f?(logn/ loglogn) in the cell probe model with cell size O(logn). 

Recall that the version of range counting we consider is actually dominance counting modulo 2 on the 
n X n^ grid. 

The main idea behind our reduction is to consider the length-n"'^+^ binary string consisting of the answers 
to all n^"*"^ possible range queries (in the natural order, i.e. row- by-row, and in each row from left to right); 
call this the answer string of the corresponding range counting instance. We prove that this string can be 
represented using a grammar of size 0(n log n). The reduction obviously follows, since a range query can 
then just be answered by asking for one bit of the compressed string. 



Lemma 6. For any range counting problem in 2D, the answer string can he represented by a CFG of size 
0{n\ogn). 

The idea behind the proof of the lemma is to sim.ulate a sweep of the point-set from top to bottom by 
a dynamic one-dimensional range tree. The symbols of the CFG will correspond to the nodes of the tree. 
With each new point encountered, only 21ogn new symbols have to be introduced. 

Proof. Assume for simplicity that n is a power of 2. It is easy to see that the answer string could be built 
by concatenating the answers in a row-wise order, just as illustrated in Figure 2. 
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Fig. 2. The answer string for this instance is 0000011100010010. The value in the grids are the query results for queries 
falling in the corresponding cell, including the bottom and left boundaries, excluding the right and top boundaries. 

We are going to build the string row by row. Think of a binary tree representing the CFG built for the 
first row of the input. The root of the tree derives the first row of the answer string, whose two children 
respectively represent the answer string for the left and the right half of the row. In this way the tree is built 
recursively. The leaves of the tree are terminal symbols in {0, 1}. Thus there are 2n — 1 symbols in total for 
the whole tree. At the same time we also maintain the negations of the symbols in the tree, i.e., making a 
new symbol g'^ for each gi in the tree, where g[ = 1 — gi if gi is a terminal symbol, or g[ — g'jg'f. if gi — gjgk- 

The next row in the answer string will be built by changing at most 2p\ogn symbols in the old tree, 
where p is the number of new points in the next row. The symbols for the new row are built by re-using 
most of the symbols in the old row, and introducing new symbols where needed. We process the new points 
one by one, and for each one, the modifications needed all lie in a path from a leaf to the root of the tree. 
Assuming the update is the path /ii, /12, . . . , /iiogn, the new tree will contain an update of /ii, /i2, . . . , /iiogn- 
Also, all the right children of these nodes will be switched with their negations (this switching step does not 
actually require introducing any new symbols). An intuitive picture of the process is given in Figure 3. 

It is easy to see for each new point, 2 log n additional symbols are created, log n of them are the new 
symbols (/ii, . . . , /iiogn), and another logn of them are their negations {h[, ■ ■ . ,h[^ ). After all, we use 
2ri — 1 + (71^ — 1) • 2 log n ^ 0{n + n^ log n) symbols to derive the whole answer string. D 

By using the above lemma, we have the lower bound of the grammar random access problem. 

Theorem 5. Any data structure using space 0{n polylog n) for the grammar random access problem requires 
i7(log ri/ log log n) query time. 

Proof. For inputs of the range counting problem, we compress the answer string to a CFG of size O(nlogn) 
according to Lemma 6. After that we build a data structure for the random access problem on this CFG using 
Lemma 7. For any query (a;, y) of the range counting problem, we simply pass the query result on the index 
{y—l)n+x^l on the answer string as an answer. By Bille et al. [ ] this makes a data structure using 0{n\ogn) 
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ft, = 0111 



''2 =W \ S6 = 11 





gi = 52 = 33 = 54 = 51 = fti = 1 33 = 1 94 = 1 

(a) The trees built for the first and second rows for the example (b) The general process illustrated by 

in Firgure 2. graph. The black parts stands for the 

negations of corresponding symbols. 

Fig. 3. Examples for building answer strings. 

space with t(nlogn) query time for the range counting problem. According to Lemma 5 the lower bound for 
range counting is i7(log?i/loglogn) for 0(ri polylogn) space, thus i(rilogn) — /?(log7i/loglogn) => t(n) = 
n {log n /log log n). D 

Note that natural attempt is to replace the ID range tree that we used above by a 2D range tree and 
perform a similar sweep procedure, but this does not seem to work for building higher dimension answer 
strings. 

5 Optimality 

In this section, we show that the upper bound in Bille et al. ['i] is nearly optimal, for two reasons. First, it 

is clear that by Theorem 5, the upper bound in Lemma 7 is optimal, when the space used is O(npolylogn). 

Second, in the cell-probe model with words of size log L we also have the following lemma by Bille et 

al [:]. 

Lemma 7. There is a data structure for the grammar random access problem with 0{n) space and 0{logL) 
time. This data structure works in the word RAM with words of size logL. 

Lemma 8. There is a data structure for the grammar random access problem with 0{n) space and 0{n log n/ log L) 
time. 

Proof. This is a trivial bound. The number of bits to encode the grammar is 0{nlogn) since each rule 
needs 0{logn) bits. The cell size is O(logL), so in 0(nlogn/logL) time the querier can just read all of the 
grammar. Since computation is free in the cell-probe model, the querier can get the answer immediately. 

D 

Thus, we have the following corollary, by using Lemma 7 when n = i7(log L/ log log L) and Lemma 8 
when the case n — 0(log^ L/ log log L). This corollary implies that our lower bound of J?(n-'^/^~^) is nearly 
the best one can hope in the cell-probe model. 

Corollary 1. Assuming w — logL, there is a data structure in the cell-probe model with space 0{n) and 
time 0{\/nlogn). 

6 Extensions and Variants 

In this section we discuss a bit about what the lowed bound means for LZ-based compression, and ways to 
extend the lower bound to BWT-based compressions. 



6.1 LZ-based Compression 

The typical cases for grammar-based compression is Lempel-Ziv [14,8]. First let us look at LZ77. For LZ77 
we have the following lemma. 

Lemma 9 (Lemma 9 of [5]). The length of the LZ77 sequence for a string is a lower bound on the size of 

the smallest grammar for that string. 

The basic idea of this lemma is to show that each rule in the grammar only contribute one entry for LZ77. 
Since LZ77 could compress any string with small grammar size into a smaller size, it can also compress the 
string sy in Lemma 1 into a smaller size. Thus the lower bound for grammar random access problem also 
holds for LZ77. 

The reader might also be curious about what will happen for the LZ78 [15] case. Unfortunately the 
lowerbound does not hold for LZ78. This is because LZ78 is a "bad" compression scheme that even the input 
is 0" of all O's, LZ78 can only compress the string to length of -y/n. But a random access on an all string 
is trivially constant with constant space. So we are not able to have any lower bounds for this case. 

There are also lots of other variants of LZ-based compressions. As long as the compression is efficient like 
LZ77, we have the lower bound. Otherwise if it is like LZ78, we do not. 

6.2 Lower Bound for BWT-based Compression Access 

In last sections we talked about strings that could be compressed efhciently used by grammar-based com- 
pression scheme. But it might be an interesting question to ask if we take another compression approach, 
say, BWT-based compression, does the lower bound holds as well? We answer this question positively here. 
We claim that with a little modification used our "hard instance" used in last section could be efficiently 
compressed by BWT, so that our lower bound holds for BWT as well. 

The BWT of a binary string s £ {0, 1} could be obtained by the following process. We use s'^ to denote 
the string which is the concatenation of the substring s[k + l . . . iV]-|-'$'-|-s[l . . .k] where $ is the end-of-string 
symbol lexicographically smaller than and 1, and s[l . . . k] is the substring obtained by the first k bits of 
s. The BWT of the string s is the string formed by the last characters of list {s*^} for < k < L after 
sorting. This string is of length L + 1 but it will have long "runs", which means maximal consecutive I's 
or O's when omitting '$'. For example in Figure 4 there are 3 runs '0','111' and '00'. We call the function 
defined in this process bwt : {0, 1}" — > {0, 1, $}" . And this bwt function is invcrtiblc according to [ ], that 
is, given bwt(s), one can always recover s in linear time. 



010110$ 


$010110 


10110$0 


0$01011 1 


0110$01 


010110$ $ 


110$010 => 0110$01 1 


10$0101 


10$0101 1 


0$01011 


10110$0 


$010110 


110$010 



Fig. 4. The Burrows- Wheeler Transformation for the string 5* — 010110. The strings on the right side are sorted in 
lexicographic order. Note that '$' is the end-of-string character which is smaller than both and 1. The value of the 
bwt function is the last column of the sorted list. 

However, BWT itself is not a compression algorithm. But there are several approaches to compress the 
text efficiently after BWT, e.g., [ , ], first use MTF (move- to- front) encoding, and then arithmetic encoding. 
The compressed length is nlogL. For binary strings, there is a much easier way to bound the number of bits 
for storing the compressed string. We can just save the length and an extra bit indicating or 1 for each 
run, and logL bits for storing the position of '$'. Thus the compressed length is nlogL. 



Lemma 10. For a binary string s, (fruns(s) = n, then the BWT-hased compressed representation of s is 
less than n log L bits. 

In the appendix, we will prove the following lemma for a string s'-y which is quite similar to the string 
sy used in Lemma 2. 

Lemma 11 (Sketched Version of Lemma 15). There is a string s'-y constructed for blocked set Y and a 
mapping a : 2"'*''"'^^ such that for any set X S [n], Sy[c(X)] — (X n Y — 0). And the length of Sy is {AB)^ , 
while runs(sY) = 0{n). 

By using this lemma we prove the following main theorem for BWT, with details in appendix. 

Theorem 6. For random access a bit in a binary string compressed by BWT-based methods with m — nlogL 
bits. Assume w — a;(logiS). Let e > be any arbitrarily small constant. For any 2-sided-error data structure, 

t > n/w^'^^ . And in terms of L, t > °^ e . When setting w — logL and S = poly{m) (polynomial in 

log S-w 1^^ 

m), there is a constant 6 such that we get t > m^'^~*. And in terms of L, t > (logL)^^''. 

Acknowledgement 

We thank Travis Gagie and Pawel Gawrychowski for helpful discussions. 

References 

1. L. Babai, P. Frankl, and J. Simon. Complexity classes in communication complexity theory. In FOCS, pages 
337-347, 1985. 

2. Z. Bar-Yossef, TS Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and 
communication complexity. Journal of Computer and System Sciences, 68(4):702-732, 2004. 

3. P. Bille, G.M. Landau, R. Raman, S. Rao, K. Sadakane, and O. Weimann. Random access to grammar compressed 
strings. In SODA, 2011. 

4. M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm. Digital, Systems Research 
Center, 1994. 

5. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar 
problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. 

6. F. Claude and G. Navarro. Self-indexed text compression using straight-line programs. Mathematical Foundations 
of Computer Science 2009, pages 235-246, 2009. 

7. S. Deorowicz. Second step algorithms in the Burrows- Wheeler compression algorithm. Software: Practice and 
Experience, 32(2):99-lll, 2002. 

8. A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75- 
81, 1976. 

9. G. Manzini. An analysis of the Burrows- Wheeler transform. Journal of the ACM, 48(3):430, 2001. 

10. P.B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication 
complexity. In STOC, page 111. ACM, 1995. 

11. M. Patrascu. Unifying the Landscape of Cell-Probe Lower Bounds. SIAM Journal on Computing, 40(3), 2011. 

12. A. A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2) :385-390, 
1992. 

13. A.C.C. Yao. Should tables be sorted? Journal of the ACM, 28(3):615-628, 1981. 

14. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on information 
theory, 23(3):337-343, 1977. 

15. J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on 
Information Theory, 24(5):530-536, 1978. 

10 



A Range Counting Lower Bound 

In this section we prove the lower bound for range counting (Lemma 5). The way of proving the lower bound 
for the range counting problem in Patra§cu [11] is by reduction from the reachability problem on butterfly 
graphs. However, the version used in Patra§cu's paper can only be reduced to range counting on [n] x [n] 
grid. In order to make the reduction work for [n] x [n'^] grid, we need the following unbalanced butterfly 
graph for the reachability problem. 

Definition 5 ((Unbalanced) Butterfly Graph). A graph G — {V,E) is called butterfly graph iff there 
exists some integers H, N,B,D such that \V\ — HN + —q-, \E\ — HNB, log^ j) ^ D, and the vertices of G 
could be labeled uniquely as (/i, 6, Oi, . . . , ajj) where h G [H], b € {0}U[D] (representing the layer), ai, . . . , ao S 
[B]. And an directed edge (w, u) £ E iff u labeled (ft,, i, ai, ..., a^-i, flj, 0^4.1, ..., a^j) and v labeled {h,i + 
1, fli, . . . , ai_i,a^,ai+i, . . . ,a]j). 

If we merge the vertices having difference h labels in the layer of the graph together, i.e., make the vertices 
labeled with {hi, 0, ai, . . . , ajj) and (/12, 0, ai, . . . , ao) the same vertex, we call it an unbalanced butterfly graph. 
All the vertices on other layers can only have one unique label. So there are -g- vertices in layer and -g- 
vertices in other layers. For an example, the reader may refer to Figure 5. 

Note here the D < log^ -^ condition is enforced to make sure that the number of different labels H{D + 
1)B^ < \V\ ^HN+^. 





Fig. 5. From butterfly graph to unbalanced butterfly graph. 

li H = 1, the unbalanced butterfly graph is just the butterfly graph. In Patra§cu's paper a relation- 
ship between reachability in the butterfly graph and range counting problem is given [ , Section 2.1+ Ap- 
pendix A]. Actually the proof could be extended to the unbalanced butter fly graphs with little mod- 
ifications. The basic idea of the proof is to map an edge (/i,i,ai, . . . ,0^-1, a^, a^+i, . . . ,«£)) — >■ {h,i + 
l,ai, . . . ,ai_i,a'^,ai^i, . . . ,aD) to a rectangle on the grid. This is because all the vertices in layer that 
leads to this edge is are the vertices {h, 0, *,..., *, ai, Oi+i, . . . , an), and all the vertices in the last layer this 
edge leads to are the vertices {h,D,ai, . . . ,a[,-*:, . . . ,'k), where • means arbitrary values. And a reachability 
query from layer to the last layer could be translated to a stabbing query, which will further be translated 
to a range counting query. We state the following lemma with the proof leaving to the reader. 

Lemma 12. A range counting data structure on a P x Q grid could be used to solve the reachability problem 
on unbalanced butterfly graphs with P vertices in layer and Q vertices in the last layer with same space and 
query time. The reachability problem is to answer queries if there is a directed path from vertex u to vertex 
V where u is a vertex in layer and v is a vertex in the last layer. 

By choosing n = HB, H = n^""^ and D — 6* (log n/ log log n), we can see that the unbalanced butterfly 
graph has -^ = 0{n^) vertices in layer and 0{n) vertices in the last layer. Thus a lower bound for 
reachability on this graph will implies lower bound for range counting on the [n] x [n^] grid. 

Lemma 13 ([ ]). For the reachability problem on the butterfly graphs with log^ jj > D, there exists some 
constant c such that when B > max < (^^J^\ , w^ L the query time t — Q[D). 

And it is easy to observe that a reachability data structure for unbalanced butterfly graphs is also a 
reachability data structure for butterfly graphs. 
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Lemma 14. A reachability data structure for unbalanced butterfly graphs could be used to solve reachability 
problem for butterfly graph using the same space and query time. 

Proof. This is almost trivial to prove. For a query on vertices {hi, 0, ai, . . . , ajy) and (/i2i D,ai, . . . , ajj) in 
the butterfly graph, it will just map to the vertices with the same label on the corresponding unbalanced 
butterfly graph. D 

Finally, we have the proof for Lemma 5. 

Proof (Lemma 5). We know that w = O(logn) and S < n\og n for some constant d. And since D < 
logs % < logn, we can choose B = log('^+^^^+^ n to make B > {elog"^ nlogny > {^Y and B > log'^^^ n > 
w'^. It implies that t — Q{D). We can choose D — 2(d^i)c+2 ' lo °io" n — ^'^Sb j>- Thus in the best case we 
have t = J?(log n/ log log n) . D 

B Remaining Proofs for BWT-based Compressions 

Here we bound the number of "runs" in sy (see Lemma 2 for definition) . We are going to show that for the 
string derived by a variant of the string sy, BWT-based compression schemes could compress it well. As 
a result, by using a similar argument of Lemma 2, randomly accessing a bit in BWT compressed strings is 
also hard. 

An important observation for string sy is that it could also be derived by applying the following replace- 
ment rules to the string "1" sequentially for i — 1,2, . . . ,N. 

- 1 ^ hi; 

where 

h, = {B{i - 1) ^ Y) {B{t - 1) + 1 ^ Y) • • • {B{i -1) + B-1^Y) 

is a binary string. The reader could simply check that it defines the same string as sy in Lemma 2. 

Let So = 1 and Si to be the binary string obtained by applying the replacement rules for i to Si_i. Here 
replacement rules means that we simply replace every in Si_i with O'' and every 1 with hi. And we note 
that sjv = Sy. And we have the following lemma for a variant of sy. 

Lemma 15. For every set Y C [BN], let sq = 1 o,i^d n = BN , for i = 1,2, . . . , L, we apply the following 
replacement rule on Sj_i to get Si, 

where the length of h'^ is AB and h'^ is hi with replaced with 1011 and 1 replaced by 1101, e.g., if hi = 01 
then h[ = 10111101. 

We let s'-Y — s'j^ , then the following properties of s'^ is true. 

1. \s'^\ = (4i?)^ = L; 

2. There is a function a independent ofY such that for every blocked set X C [BN], Sy[<''(^)] — (Xn Y = 

0); 

3. runs(s;^) = 0{n) = 0{BN). 

First assuming this lemma is true, we have the proof for Theorem 6. 

Proof (Theorem 6). From Lemma 15 and Lemma 10 we know that one can store s'^ in a string with 
compressed size at most m — ulogL bits. Pick 6 = yz^, it is easy to verify the t > m^^^^^ lower bound. All 
the rest are similar to the proof of Theorem 4. D 
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Now we prove Lemma 15. 
Proof (Lemma 15). First of all, it is easy to make the following observations. 

1. Each h\ starts with 1 and ends with 1. 

2. The number of consecutive O's in s'.^ is {ABy . 

3. The number of consecutive I's in s'.^ is at most 4. 

4. The length of s', is (4B)\ 

5. If s^4rB+Q starts with 7 O's and 7 is not a multiple of 45 or 1, then g^^rS+a ^^^^^ ^^^.j^ q_ 

6. If we represent a blocked set X in integer as (ai, 02, ... , 0,^)3 as an base B integer, then we know that 
for the integer ct(X) — (4ai + 2, 4a2 + 2, . . . , 4a„i + 2)43 in base 45, we have sy[X] — Sy[ct(X)]. So this 
is the (T we want. 

Second, we are going to show that runs(s'J < runs(s'j_j^) + 512_B. If this is true, then runs(s^) < runs(s^_j^) + 
512i? < • • • < runs(sg) + 512BN = 512BN, which is the third item we need to prove. 

The way of upper bounding runs(s^) by runs(s^_]^) is to see the process of computing BWT of sj as 
first computing the BWT of s'^_i then inserting the rest bits. Precisely speaking, if we group the bits in s'^ 
into segments of length AB in s^ from start, we know each segment is derived by a bit in s^_x- ^ simple 

observation is that by looking at the start of each segment (Sj", . • . , s/ ), the last bit of the sorted list 
is the same as the BWT of s^_i. So there are runs(s^_]^) runs in the string formed by last bits of the sorted 
list. 

So all the runs(s9 ~ runs(s^_j) runs come from the other parts of sj, say s.4''^+" for q, g [4_B],r e 
[{4:By~^],a € [4:B — 1]. We discuss about them in two cases. 

Case I. For s^^rS+a ^^^^^^ ^^^^ ^ ^^ q,^^ ^^ ^.^^^ ^-^^^ j^^. ^^^ ^ ^ [(45)*-i], s^^pS+a ^^^^ ^^ ^^^ ^^^^ 

bit as s^ "^ '*'". If we look at another string s^ "^ where f5 ^ a, then by the following lemma we know that 
the first 5 • AB of them differs. 

Lemma 16. The first 5- AB bits in sf^^^"' and s'^'^''^'^'^ (p,qe [{ABy-'^],a,/3 e [AB-l],a^ (3) are always 
different for different a ^ (3 if they do not start with more than AB 's. 

And by the following lemma, we know the last character of s^^ ^" only depends on the first 5 • AB bits. 

Lemma 17. // the first 5 • AB characters from s^ ^ " and Sj "^ are the same, and they are not start 

with more than AB O's, then they end with the same character. 

So we know that if we group these s^ ^ '*'" into groups according to the first 5 • AB bits, then there will 
be < 64 • AB groups and each group will end in the same bit. So the number of runs increased by inserting 
these s['^^B+a is < 2 • 64 • 45 = 5125. 

Case II. For g^^rS+Q g^^rts with > 45 O's. Since a ^ 0, s^4rB+a ^^^^ ^^^j^ q^ j^^^ ^^ know that after sorting 
all these g^^rS+a starting with > Al'B O's and < A{1 + 1)5 O's, they will be inserted into s'^^^^ and s^"'^ for 
some p and q where one of them starts with 4/5 O's and another of them starts with A{1 + 1)5 O's. By the 
following lemma and the fact that all the strings inserted here will be before the strings in case I, we know 
that at least one of them ends with so the number of runs increased is 0. 

Lemma 18. For all possible choices of p,q e [(45)'^^], if there are ABl O's in the prefix of s^ ^ and 
AB{1 + 1) O's in the prefix of Sj ' (I > 1), then at least one of them ends in 0. 

a 

At last we prove the lemmas left. 
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Proof (Lemma 16). We prove by contradiction. Say if there exists p, g, a, /3 such that s^' " and Sj' 
have common prefix longer than 5 • AB and a ^ 13. 

According to the assumption the first 205 bits of them are all the same. However, we know that there 
are < 4_B O's at the start of the two strings. So segment 3 and segment 4 must be h[ corresponds to two 
I's in s[_i. Since we know that h'^ starts and ends with 1, so segment 5 must be /ij as well. By the same 
argument we know that segment 6 to segment 12 are all h'^. However there are at most 4 consecutive I's in 
Si_i, so it is not possible. 

For all possible choices of p € [(4i3)*~^] and a G [4i?], there are only 64 • AB different kinds of prefixes of 
length 5 • 45 in s/ . This is because these 5 • 4_B bits are derived by at most 6 bits in Si-i. The number 
of all the possible combinations of these 6 bits is 2^ — 64. And a has AB different choices, so the number of 
possible prefixes is 64 • AB = 256B. D 

Proof (Lemma 17). If the first 5 • 45 characters from s,- ^ and s- ' are the same, and they are not 

start with more than 4b O's. 

Then by Lemma 16, we know a = /3. 

And we can also easily know that s^[4p5 + a] and s'^lAqB + f3] are generate by a same character. That is 
Si-i[p+ 1] and s^_j^[q + l], which are the (p+l)'s character and (q+l)'s character of s'^_i, must be the same. 

So the character s'^[ApB + a] and s^[4q5 + /3] are also the same, and s-[4p5 + a] is the last character of 
Sj ^ , s^[4p5 + /3] is the last character of s- ^ D 

Proof (Lemma 18). We know that in s'^_i, there are ? + 1 O's in the prefix of s^^-^ and / O's in the prefix 

of s,i(^_^. When I > 2, we know that they must be derived by a in sj_2- However, we know that 45|^ and 
4:B\l + 1 can not hold simultaneously, so at least one of them must end with 0. 

If / = 1, we know that Sj '' starts with 85 O's. It must be from a in s^_2i however, when 5 > 2 we 
know that s- '' ends with 0. D 
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Fig. 6. Alignment of s,:*p^+° and s^^«^+'' 
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